PPO Part 1: The Surrogate Function

Re-weighting the Policy Gradient

Suppose we are trying to update our current policy, \pi_{\theta'}. To do that, we need to estimate a gradient, g. But we only have trajectories generated by an older policy \pi_{\theta}. How do we compute the gradient then?

Mathematically, we could utilize importance sampling. The answer just what a normal policy gradient would be, times a re-weighting factor P(\tau;\theta')/P(\tau;\theta):

g=\frac{P(\tau; \theta')}{P(\tau; \theta)}\sum_t \frac{\nabla_{\theta'} \pi_{\theta'}(a_t|s_t)}{\pi_{\theta'}(a_t|s_t)}R_t^{\rm future}

We can rearrange these equations, and the re-weighting factor is just the product of all the policy across each step -- I’ve picked out the terms at time-step t here. We can cancel some terms, but we're still left with a product of the policies at different times, denoted by "…".

g=\sum_t \frac{...\, \cancel{\pi_{\theta'}(a_t|s_t)} \,...} {...\,\pi_{\theta}(a_t|s_t)\,...} \, \frac{\nabla_{\theta'} \pi_{\theta'}(a_t|s_t)}{\cancel{\pi_{\theta'}(a_t|s_t)}}R_t^{\rm future}

Can we simplify this expression further? This is where proximal policy comes in. If the old and current policy is close enough to each other, all the factors inside the "…" would be pretty close to 1, and then we can ignore them.

Then the equation simplifies

g=\sum_t \frac{\nabla_{\theta'} \pi_{\theta'}(a_t|s_t)}{\pi_{\theta}(a_t|s_t)}R_t^{\rm future}

It looks very similar to the old policy gradient. In fact, if the current policy and the old policy is the same, we would have exactly the vanilla policy gradient. But remember, this expression is different because we are comparing two different policies

The Surrogate Function

Now that we have the approximate form of the gradient, we can think of it as the gradient of a new object, called the surrogate function

g=\nabla_{\theta'} L_{\rm sur}(\theta', \theta)

L_{\rm sur}(\theta', \theta)= \sum_t \frac{\pi_{\theta'}(a_t|s_t)}{\pi_{\theta}(a_t|s_t)}R_t^{\rm future}

So using this new gradient, we can perform gradient ascent to update our policy -- which can be thought as directly maximize the surrogate function.

But there is still one important issue we haven’t addressed yet. If we keep reusing old trajectories and updating our policy, at some point the new policy might become different enough from the old one, so that all the approximations we made could become invalid.

We need to find a way make sure this doesn’t happen. Let’s see how in part 2.